Motor vehicle collisions reported by the New York City Police Department from January-August 2020. Each record represents an individual collision, including the date, time and location of the accident (borough, zip code, street name, latitude/longitude), vehicles and victims involved, and contributing factors.
Data Source : https://www.mavenanalytics.io/data-playground
import pandas as pd
import numpy as np
from pandas import Timestamp
import datetime as dt
from datetime import datetime
df= pd.read_csv("NYCAccidents2020.csv")
df.head(5)
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-08-29 | 15:40:00 | BRONX | 10466.0 | 40.89210 | -73.833760 | POINT (-73.83376 40.8921) | PRATT AVENUE | STRANG AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 4342908 | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
| 1 | 2020-08-29 | 21:00:00 | BROOKLYN | 11221.0 | 40.69050 | -73.919914 | POINT (-73.919914 40.6905) | BUSHWICK AVENUE | PALMETTO STREET | NaN | ... | Unspecified | NaN | NaN | NaN | 4343555 | Sedan | Sedan | NaN | NaN | NaN |
| 2 | 2020-08-29 | 18:20:00 | NaN | NaN | 40.81650 | -73.946556 | POINT (-73.946556 40.8165) | 8 AVENUE | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4343142 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
| 3 | 2020-08-29 | 00:00:00 | BRONX | 10459.0 | 40.82472 | -73.892960 | POINT (-73.89296 40.82472) | NaN | NaN | 1047 SIMPSON STREET | ... | Unspecified | Unspecified | Unspecified | NaN | 4343588 | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | Sedan | Motorcycle | NaN |
| 4 | 2020-08-29 | 17:10:00 | BROOKLYN | 11203.0 | 40.64989 | -73.933890 | POINT (-73.93389 40.64989) | NaN | NaN | 4609 SNYDER AVENUE | ... | Unspecified | NaN | NaN | NaN | 4342953 | Sedan | Sedan | NaN | NaN | NaN |
5 rows × 29 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 74881 entries, 0 to 74880 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRASH DATE 74881 non-null object 1 CRASH TIME 74881 non-null object 2 BOROUGH 49140 non-null object 3 ZIP CODE 49134 non-null float64 4 LATITUDE 68935 non-null float64 5 LONGITUDE 68935 non-null float64 6 LOCATION 68935 non-null object 7 ON STREET NAME 55444 non-null object 8 CROSS STREET NAME 35681 non-null object 9 OFF STREET NAME 19437 non-null object 10 NUMBER OF PERSONS INJURED 74881 non-null int64 11 NUMBER OF PERSONS KILLED 74881 non-null int64 12 NUMBER OF PEDESTRIANS INJURED 74881 non-null int64 13 NUMBER OF PEDESTRIANS KILLED 74881 non-null int64 14 NUMBER OF CYCLIST INJURED 74881 non-null int64 15 NUMBER OF CYCLIST KILLED 74881 non-null int64 16 NUMBER OF MOTORIST INJURED 74881 non-null int64 17 NUMBER OF MOTORIST KILLED 74881 non-null int64 18 CONTRIBUTING FACTOR VEHICLE 1 74577 non-null object 19 CONTRIBUTING FACTOR VEHICLE 2 59285 non-null object 20 CONTRIBUTING FACTOR VEHICLE 3 6765 non-null object 21 CONTRIBUTING FACTOR VEHICLE 4 1851 non-null object 22 CONTRIBUTING FACTOR VEHICLE 5 523 non-null object 23 COLLISION_ID 74881 non-null int64 24 VEHICLE TYPE CODE 1 74246 non-null object 25 VEHICLE TYPE CODE 2 53638 non-null object 26 VEHICLE TYPE CODE 3 6424 non-null object 27 VEHICLE TYPE CODE 4 1771 non-null object 28 VEHICLE TYPE CODE 5 503 non-null object dtypes: float64(3), int64(9), object(17) memory usage: 16.6+ MB
We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:
How many duplicate transaction records are there?
dup_rows = df.duplicated().sum()
dup_rows
0
Drop the duplicated records.
# your code here
df = df.drop_duplicates()
df.head()
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-08-29 | 15:40:00 | BRONX | 10466.0 | 40.89210 | -73.833760 | POINT (-73.83376 40.8921) | PRATT AVENUE | STRANG AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 4342908 | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
| 1 | 2020-08-29 | 21:00:00 | BROOKLYN | 11221.0 | 40.69050 | -73.919914 | POINT (-73.919914 40.6905) | BUSHWICK AVENUE | PALMETTO STREET | NaN | ... | Unspecified | NaN | NaN | NaN | 4343555 | Sedan | Sedan | NaN | NaN | NaN |
| 2 | 2020-08-29 | 18:20:00 | NaN | NaN | 40.81650 | -73.946556 | POINT (-73.946556 40.8165) | 8 AVENUE | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4343142 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
| 3 | 2020-08-29 | 00:00:00 | BRONX | 10459.0 | 40.82472 | -73.892960 | POINT (-73.89296 40.82472) | NaN | NaN | 1047 SIMPSON STREET | ... | Unspecified | Unspecified | Unspecified | NaN | 4343588 | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | Sedan | Motorcycle | NaN |
| 4 | 2020-08-29 | 17:10:00 | BROOKLYN | 11203.0 | 40.64989 | -73.933890 | POINT (-73.93389 40.64989) | NaN | NaN | 4609 SNYDER AVENUE | ... | Unspecified | NaN | NaN | NaN | 4342953 | Sedan | Sedan | NaN | NaN | NaN |
5 rows × 29 columns
#Convert object to datatime
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'])
df['CRASH TIME'] = pd.to_datetime(df['CRASH TIME'],format='%H:%M:%S').dt.time
How many missing values are there?
df.isnull().sum() #Find the number of null per each columns
CRASH DATE 0 CRASH TIME 0 BOROUGH 25741 ZIP CODE 25747 LATITUDE 5946 LONGITUDE 5946 LOCATION 5946 ON STREET NAME 19437 CROSS STREET NAME 39200 OFF STREET NAME 55444 NUMBER OF PERSONS INJURED 0 NUMBER OF PERSONS KILLED 0 NUMBER OF PEDESTRIANS INJURED 0 NUMBER OF PEDESTRIANS KILLED 0 NUMBER OF CYCLIST INJURED 0 NUMBER OF CYCLIST KILLED 0 NUMBER OF MOTORIST INJURED 0 NUMBER OF MOTORIST KILLED 0 CONTRIBUTING FACTOR VEHICLE 1 304 CONTRIBUTING FACTOR VEHICLE 2 15596 CONTRIBUTING FACTOR VEHICLE 3 68116 CONTRIBUTING FACTOR VEHICLE 4 73030 CONTRIBUTING FACTOR VEHICLE 5 74358 COLLISION_ID 0 VEHICLE TYPE CODE 1 635 VEHICLE TYPE CODE 2 21243 VEHICLE TYPE CODE 3 68457 VEHICLE TYPE CODE 4 73110 VEHICLE TYPE CODE 5 74378 dtype: int64
#Drop the missing values in 'LOCATION' column
df = df[df['LOCATION'].notna()]
df = df[df['CONTRIBUTING FACTOR VEHICLE 1'].notna()]
df.rename(columns = {"CRASH DATE": "crash_date"}, inplace=True)
df.rename(columns = {"CRASH TIME": "crash_time"}, inplace=True)
df.rename(columns = {"BOROUGH": "borough"}, inplace=True)
df.rename(columns = {"ZIP CODE": "zip_code"}, inplace=True)
df.rename(columns = {"LATITUDE": "latitude"}, inplace=True)
df.rename(columns = {"LONGITUDE": "longitude"}, inplace=True)
df.rename(columns = {"LOCATION": "location"}, inplace=True)
df.rename(columns = {"ON STREET NAME": "on_street_name"}, inplace=True)
df.rename(columns = {"CROSS STREET NAME": "cross_street_name"}, inplace=True)
df.rename(columns = {"OFF STREET NAME": "off_street_name"}, inplace=True)
df.rename(columns = {"NUMBER OF PERSONS INJURED": "number_of_persons_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF PERSONS KILLED": "number_of_persons_killed"}, inplace=True)
df.rename(columns = {"NUMBER OF PEDESTRIANS INJURED": "number_of_pedestrians_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF PEDESTRIANS KILLED": "number_of_pedestrians_killed"}, inplace=True)
df.rename(columns = {"NUMBER OF CYCLIST INJURED": "number_of_cyclist_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF CYCLIST KILLED": "number_of_cyclist_killed"}, inplace=True)
df.rename(columns = {"NUMBER OF MOTORIST INJURED": "number_of_motorist_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF MOTORIST KILLED": "number_of_motorist_killed"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 1": "contributing_factor_vehicle_1"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 2": "contributing_factor_vehicle_2"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 3": "contributing_factor_vehicle_3"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 4": "contributing_factor_vehicle_4"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 5": "contributing_factor_vehicle_5"}, inplace=True)
df.rename(columns = {"COLLISION_ID": "collision_id"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 1": "vehicle_type_code_1"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 2": "vehicle_type_code_2"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 3": "vehicle_type_code_3"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 4": "vehicle_type_code_4"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 5": "vehicle_type_code_5"}, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 68647 entries, 0 to 74880 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 crash_date 68647 non-null datetime64[ns] 1 crash_time 68647 non-null object 2 borough 47517 non-null object 3 zip_code 47511 non-null float64 4 latitude 68647 non-null float64 5 longitude 68647 non-null float64 6 location 68647 non-null object 7 on_street_name 49872 non-null object 8 cross_street_name 32734 non-null object 9 off_street_name 18775 non-null object 10 number_of_persons_injured 68647 non-null int64 11 number_of_persons_killed 68647 non-null int64 12 number_of_pedestrians_injured 68647 non-null int64 13 number_of_pedestrians_killed 68647 non-null int64 14 number_of_cyclist_injured 68647 non-null int64 15 number_of_cyclist_killed 68647 non-null int64 16 number_of_motorist_injured 68647 non-null int64 17 number_of_motorist_killed 68647 non-null int64 18 contributing_factor_vehicle_1 68647 non-null object 19 contributing_factor_vehicle_2 54520 non-null object 20 contributing_factor_vehicle_3 6200 non-null object 21 contributing_factor_vehicle_4 1704 non-null object 22 contributing_factor_vehicle_5 489 non-null object 23 collision_id 68647 non-null int64 24 vehicle_type_code_1 68274 non-null object 25 vehicle_type_code_2 49155 non-null object 26 vehicle_type_code_3 5885 non-null object 27 vehicle_type_code_4 1629 non-null object 28 vehicle_type_code_5 471 non-null object dtypes: datetime64[ns](1), float64(3), int64(9), object(16) memory usage: 15.7+ MB
import calendar
df['year_month'] = df['crash_date'].apply(lambda x: x.strftime('%Y-%m'))
df['year'] = df['crash_date'].apply(lambda x: x.strftime('%Y'))
df['month'] = df['crash_date'].apply(lambda x: x.strftime('%m')).astype(int)
df['month_'] = df['month'].apply(lambda x: calendar.month_name[x])
df['weekday'] = df['crash_date'].apply(lambda x: x.strftime('%A'))
df['day'] = df['crash_date'].apply(lambda x: x.strftime('%d'))
df['accident_count']=1
Data cleaning finished!! 👏 Now! We are ready to answer questions and draw conclusions using our data. 👌 🍀
!pip install chart_studio
Collecting chart_studio Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB) Collecting retrying>=1.3.3 Using cached retrying-1.3.3.tar.gz (10 kB) Requirement already satisfied: six in c:\anaconda3\anaconda3\lib\site-packages (from chart_studio) (1.16.0) Requirement already satisfied: plotly in c:\anaconda3\anaconda3\lib\site-packages (from chart_studio) (5.4.0) Requirement already satisfied: requests in c:\anaconda3\anaconda3\lib\site-packages (from chart_studio) (2.26.0) Requirement already satisfied: tenacity>=6.2.0 in c:\anaconda3\anaconda3\lib\site-packages (from plotly->chart_studio) (8.0.1) Requirement already satisfied: certifi>=2017.4.17 in c:\anaconda3\anaconda3\lib\site-packages (from requests->chart_studio) (2021.10.8) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\anaconda3\anaconda3\lib\site-packages (from requests->chart_studio) (2.0.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\anaconda3\anaconda3\lib\site-packages (from requests->chart_studio) (1.26.7) Requirement already satisfied: idna<4,>=2.5 in c:\anaconda3\anaconda3\lib\site-packages (from requests->chart_studio) (3.2) Building wheels for collected packages: retrying Building wheel for retrying (setup.py): started Building wheel for retrying (setup.py): finished with status 'done' Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11448 sha256=a49314e6e3f86b9c98c4f5323b3d9a3e10176786a4a73364ca6aa6ac792e9132 Stored in directory: c:\users\melody yip\appdata\local\pip\cache\wheels\ce\18\7f\e9527e3e66db1456194ac7f61eb3211068c409edceecff2d31 Successfully built retrying Installing collected packages: retrying, chart-studio Successfully installed chart-studio-1.1.0 retrying-1.3.3
import chart_studio
username = 'melodyyip' # your username
api_key = 'IcyruGDTrXTRHU98hvix' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
How many accidents were registered by the police in New York City in 2020 by date?
import plotly.express as px
dfg = df[['crash_date','crash_time']]
dfg = dfg.groupby(['crash_date']).count().reset_index()
dfg.rename(columns = {"crash_time": "count"}, inplace=True)
fig = px.line(dfg, x="crash_date", y="count"
,hover_data={"crash_date": "|%B %d, %Y"}
,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
)
fig.update_layout(
title="Number of accidents by Date(Jan-Aug2020)"
,xaxis_title="date"
,yaxis_title="Number of accidents"
)
fig.update_xaxes(
dtick="M1",
tickformat="%b\n%Y",
ticklabelmode="period")
fig.show("")
# fig.write_html("NYCaccidents-fig1-NumOfAccByDate.html")
import chart_studio.plotly as py
py.plot(fig, filename = 'NYCaccidents-fig1-NumOfAccByDate', auto_open=True)
'https://plotly.com/~melodyyip/1/'
import plotly.express as px
dfgg = df[['crash_date','crash_time']]
dfgg = dfgg.groupby(pd.Grouper(key='crash_date',freq='M')).count().reset_index()
dfgg.rename(columns = {"crash_time": "count"}, inplace=True)
fig = px.line(dfgg, x="crash_date", y="count"
,hover_data={"crash_date": "|%B %d, %Y"}
,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
)
fig.update_layout(
title="Number of accidents by Month (Jan-Aug2020)"
,xaxis_title="date"
,yaxis_title="Number of accidents"
)
fig.update_xaxes(
dtick="M1",
tickformat="%b\n%Y",
ticklabelmode="period")
fig.show()
df_stack = df.groupby(df['crash_date'].dt.strftime('%B'))['location'].count().reset_index()
accident_sum = df['location'].count()
df_stack['year']="2020"
df_stack['Percentage']=(df_stack['location']/accident_sum)*100
df_stack.rename(columns = {"location": "count"}, inplace=True)
df_stack.rename(columns = {"crash_date": "month"}, inplace=True)
df_stack
from pandas.api.types import CategoricalDtype
cat_month = CategoricalDtype(["January","February","March","April","May","June","July","August","Septempber", "October", "November", "December"],ordered=True)
df_stack["month"] = df_stack["month"].astype(cat_month)
df_stack.sort_values(["year", "month"])
# df_stack.columns= ['year', 'weekday', 'total_sale_dollars', 'Percentage']
# df_stack = df_stack.drop(columns=['total_sale_dollars'])
| month | count | year | Percentage | |
|---|---|---|---|---|
| 3 | January | 13106 | 2020 | 19.091876 |
| 2 | February | 12498 | 2020 | 18.206185 |
| 6 | March | 10177 | 2020 | 14.825120 |
| 0 | April | 3773 | 2020 | 5.496234 |
| 7 | May | 5651 | 2020 | 8.231969 |
| 5 | June | 6977 | 2020 | 10.163591 |
| 4 | July | 8445 | 2020 | 12.302067 |
| 1 | August | 8020 | 2020 | 11.682958 |
fig3=px.bar(df_stack
,x='Percentage'
,y='year'
,color = 'month'
, orientation='h'
, barmode = 'stack',color_discrete_sequence=px.colors.diverging.PRGn,text=df_stack['Percentage'].map('{:,.2f}%'.format)
,category_orders={"month": ["January", "February", "March", "April","May","June","July"]}
)
fig3.update_layout(title = "Percentage share of Accident by Month",
template = 'simple_white', xaxis_title = '%',
yaxis_title = 'year',
legend_title_text='Month')
import chart_studio.plotly as py
# py.plot(fig3, filename = 'NYCaccidents-fig2-PerShareOfAccidentByMonth', auto_open=True)
'https://plotly.com/~melodyyip/3/'
As we can observe, the number of accidents decreases since the end of February. One reason could be that fewer people are driving to work in these months during to COVID.
As we did with months, we can analyze the distribution of car accidents according to the day of the week by using a bar plot as well.
##Get the hours from crash_time
df['hour'] = df['crash_time'].astype(str).str[:2]
dfg = df[['hour','month_','location']]
dfg = dfg.groupby(['hour','month_']).count().reset_index()
dfg.rename(columns = {"location": "count"}, inplace=True)
dfg
fig = px.bar(dfg, x="hour", y="count",color='month_'
# ,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
,category_orders={"month_": ["January", "February", "March", "April","May","June","July"]}
)
fig.update_layout(
title="Number of accidents by Time(Jan-Aug2020)"
,xaxis_title="Time"
,yaxis_title="Number of accidents"
,legend_title_text='Month'
)
fig.show()
# py.plot(fig, filename = 'NYCaccidents-fig3-NumOfAccidentByTime', auto_open=True)
As we can observe in the plot, the greater number of accidents occur in early-morning hours 14–18. Accidents tend to be more severe in the evening.
dfg = df[['weekday','month_','location','crash_date']]
dfg = dfg.groupby(['weekday','month_','crash_date']).count().reset_index()
dfg.rename(columns = {"location": "count"}, inplace=True)
dfg = dfg.groupby(['weekday','month_'])['count'].mean().reset_index()
dfg
fig = px.bar(dfg, x="weekday", y="count",color='month_'
# ,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
,category_orders={"weekday": ["Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday","Sunday"]
,"month_": ["January", "February", "March", "April","May","June","July"]
}
)
fig.update_layout(
title="Number of accidents by Week(Jan-Aug2020)"
,xaxis_title="Time"
,yaxis_title="Number of accidents"
)
fig.show()
# py.plot(fig, filename = 'NYCaccidents-fig4-NumOfAccidentByWeek', auto_open=True)
'https://plotly.com/~melodyyip/7/'
dfg = df[['weekday','location','crash_date']]
dfg = dfg.groupby(['weekday','crash_date']).count().reset_index()
dfg.rename(columns = {"location": "count"}, inplace=True)
dfg = dfg.groupby(['weekday']).mean().reset_index()
dfg
fig = px.bar(dfg, x="weekday", y="count"
# ,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
,category_orders={"weekday": ["Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday","Sunday"]
,"month_": ["January", "February", "March", "April","May","June","July"]
}
)
fig.update_layout(
title="Average Number of accidents by Week(Jan-Aug2020)"
,xaxis_title="Time"
,yaxis_title="Number of accidents"
)
fig.show()
# py.plot(fig, filename = 'NYCaccidents-fig5-AvgNumOfAccidentByWeek', auto_open=True)
'https://plotly.com/~melodyyip/9/'
As shown in the plot above, the number of car accidents decrease at the weekend. Weekdays present around an average of 280-300 car accidents per day, around 40 more accidents than on weekends.
The data we are analyzing contains information related to (1) victims,(2) Contribution Factor, (3)vehicles. Regarding the type of accident the data frame includes information such as the number of injuries and killed,the contribution factor of the accident and the vehicles type involved in the accident.
The data frame includes information about how many victims were injuries and killed in each car accident. We can easily represent the percentage of injuries and kill using a pie plot as follows:
import plotly.express as px
df['number_of_injured']= df['number_of_persons_injured']
df.loc[df['number_of_persons_injured'] > 2, 'number_of_injured'] = '3-5'
df.loc[df['number_of_persons_injured'] > 5, 'number_of_injured'] = 'More than 5'
fig = px.pie(df, values='accident_count', names='number_of_injured', title='Injuries in 2020',color_discrete_sequence=px.colors.diverging.PRGn)
fig.show()
# py.plot(fig, filename = 'NYCaccidents-fig6-Injuries2020', auto_open=True)
C:\anaconda3\anaconda3\lib\site-packages\chart_studio\plotly\plotly.py:222: UserWarning: Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions: (1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph. (2) Trying using the image API to return an image instead of a graph URL (3) Use matplotlib (4) See if you can create your visualization with fewer data points If the visualization you're using aggregates points (e.g., box plot, histogram, etc.) you can disregard this warning.
'https://plotly.com/~melodyyip/11/'
The plot shows that 73% of the accident did not have victims injured and 21% of accident have one victims injured.
import plotly.express as px
df['number_of_killed']= df['number_of_persons_killed']
df.loc[df['number_of_persons_killed'] > 2, 'number_of_killed'] = '3-5'
df.loc[df['number_of_persons_killed'] > 5, 'number_of_killed'] = 'More than 5'
fig = px.pie(df, values='accident_count', names='number_of_killed', title='Killed in 2020',color_discrete_sequence=px.colors.diverging.PRGn)
fig.show()
# py.plot(fig, filename = 'NYCaccidents-fig7-Killed2020', auto_open=True)
'https://plotly.com/~melodyyip/13/'
Less than 1% of the accident have victims killed in the car accident.
taba = pd.crosstab(df.on_street_name, df.month_, values=df.location, aggfunc='count').round(0)
taba = taba.sort_values(by='January',ascending=False)
tabb = pd.crosstab(df.on_street_name, df.month_, values=df.location, aggfunc='count',normalize='columns').round(4)*100
tab = (
pd.concat([taba, tabb],axis = 1, keys = ['count', '%'])
.swaplevel(axis = 1)
.sort_index(axis = 1, ascending=[True, False])
.rename_axis(['month_', 'on_street_name'], axis = 1)
)[['January', 'February', 'March', 'April', 'May', 'June', 'July','August']]
tab.head(10)
| month_ | January | February | March | April | May | June | July | August | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| on_street_name | count | % | count | % | count | % | count | % | count | % | count | % | count | % | count | % |
| on_street_name | ||||||||||||||||
| BELT PARKWAY | 166.0 | 1.71 | 209.0 | 2.25 | 170.0 | 2.29 | 68.0 | 2.48 | 103.0 | 2.54 | 120.0 | 2.39 | 145.0 | 2.44 | 137.0 | 2.42 |
| BROOKLYN QUEENS EXPRESSWAY | 141.0 | 1.45 | 129.0 | 1.39 | 115.0 | 1.55 | 40.0 | 1.46 | 51.0 | 1.26 | 65.0 | 1.29 | 65.0 | 1.09 | 88.0 | 1.55 |
| GRAND CENTRAL PKWY | 128.0 | 1.32 | 114.0 | 1.23 | 87.0 | 1.17 | 29.0 | 1.06 | 38.0 | 0.94 | 49.0 | 0.97 | 71.0 | 1.19 | 48.0 | 0.85 |
| FDR DRIVE | 125.0 | 1.29 | 108.0 | 1.16 | 129.0 | 1.74 | 77.0 | 2.81 | 65.0 | 1.60 | 48.0 | 0.95 | 62.0 | 1.04 | 72.0 | 1.27 |
| LONG ISLAND EXPRESSWAY | 125.0 | 1.29 | 137.0 | 1.47 | 106.0 | 1.43 | 42.0 | 1.53 | 46.0 | 1.13 | 76.0 | 1.51 | 93.0 | 1.56 | 78.0 | 1.38 |
| MAJOR DEEGAN EXPRESSWAY | 111.0 | 1.14 | 111.0 | 1.19 | 79.0 | 1.07 | 38.0 | 1.39 | 47.0 | 1.16 | 51.0 | 1.01 | 44.0 | 0.74 | 73.0 | 1.29 |
| BROADWAY | 101.0 | 1.04 | 103.0 | 1.11 | 80.0 | 1.08 | 27.0 | 0.99 | 32.0 | 0.79 | 70.0 | 1.39 | 73.0 | 1.23 | 73.0 | 1.29 |
| ATLANTIC AVENUE | 95.0 | 0.98 | 101.0 | 1.09 | 84.0 | 1.13 | 31.0 | 1.13 | 31.0 | 0.76 | 55.0 | 1.09 | 61.0 | 1.02 | 54.0 | 0.95 |
| CROSS BRONX EXPY | 89.0 | 0.92 | 83.0 | 0.89 | 85.0 | 1.15 | 24.0 | 0.88 | 54.0 | 1.33 | 48.0 | 0.95 | 51.0 | 0.86 | 72.0 | 1.27 |
| 3 AVENUE | 86.0 | 0.89 | 81.0 | 0.87 | 80.0 | 1.08 | 16.0 | 0.58 | 27.0 | 0.66 | 48.0 | 0.95 | 43.0 | 0.72 | 37.0 | 0.65 |
taba = pd.crosstab(df.contributing_factor_vehicle_1, df.month_, values=df.location, aggfunc='count').round(0)
taba = taba.sort_values(by='January',ascending=False)
tabb = pd.crosstab(df.contributing_factor_vehicle_1, df.month_, values=df.location, aggfunc='count',normalize='columns').round(4)*100
tab = (
pd.concat([taba, tabb],axis = 1, keys = ['count', '%'])
.swaplevel(axis = 1)
.sort_index(axis = 1, ascending=[True, False])
.rename_axis(['month_', 'contributing_factor_vehicle_1'], axis = 1)
)[['January', 'February', 'March', 'April', 'May', 'June', 'July','August']]
tab.head(10)
| month_ | January | February | March | April | May | June | July | August | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| contributing_factor_vehicle_1 | count | % | count | % | count | % | count | % | count | % | count | % | count | % | count | % |
| contributing_factor_vehicle_1 | ||||||||||||||||
| Driver Inattention/Distraction | 3398.0 | 25.93 | 3387.0 | 27.10 | 2803.0 | 27.54 | 919.0 | 24.36 | 1367.0 | 24.19 | 1738.0 | 24.91 | 2152.0 | 25.48 | 2042.0 | 25.46 |
| Unspecified | 3335.0 | 25.45 | 3074.0 | 24.60 | 2523.0 | 24.79 | 1021.0 | 27.06 | 1530.0 | 27.07 | 1929.0 | 27.65 | 2414.0 | 28.58 | 2323.0 | 28.97 |
| Failure to Yield Right-of-Way | 1002.0 | 7.65 | 944.0 | 7.55 | 663.0 | 6.51 | 203.0 | 5.38 | 314.0 | 5.56 | 420.0 | 6.02 | 509.0 | 6.03 | 477.0 | 5.95 |
| Following Too Closely | 961.0 | 7.33 | 938.0 | 7.51 | 754.0 | 7.41 | 150.0 | 3.98 | 310.0 | 5.49 | 384.0 | 5.50 | 463.0 | 5.48 | 420.0 | 5.24 |
| Backing Unsafely | 588.0 | 4.49 | 559.0 | 4.47 | 456.0 | 4.48 | 153.0 | 4.06 | 200.0 | 3.54 | 234.0 | 3.35 | 265.0 | 3.14 | 272.0 | 3.39 |
| Passing or Lane Usage Improper | 530.0 | 4.04 | 504.0 | 4.03 | 379.0 | 3.72 | 124.0 | 3.29 | 188.0 | 3.33 | 265.0 | 3.80 | 321.0 | 3.80 | 288.0 | 3.59 |
| Passing Too Closely | 528.0 | 4.03 | 497.0 | 3.98 | 390.0 | 3.83 | 113.0 | 2.99 | 194.0 | 3.43 | 235.0 | 3.37 | 294.0 | 3.48 | 307.0 | 3.83 |
| Other Vehicular | 378.0 | 2.88 | 364.0 | 2.91 | 242.0 | 2.38 | 109.0 | 2.89 | 182.0 | 3.22 | 230.0 | 3.30 | 275.0 | 3.26 | 250.0 | 3.12 |
| Unsafe Lane Changing | 338.0 | 2.58 | 348.0 | 2.78 | 260.0 | 2.55 | 57.0 | 1.51 | 107.0 | 1.89 | 126.0 | 1.81 | 151.0 | 1.79 | 149.0 | 1.86 |
| Turning Improperly | 247.0 | 1.88 | 251.0 | 2.01 | 219.0 | 2.15 | 59.0 | 1.56 | 100.0 | 1.77 | 117.0 | 1.68 | 162.0 | 1.92 | 148.0 | 1.85 |
taba = pd.crosstab(df.vehicle_type_code_1, df.month_, values=df.location, aggfunc='count').round(0)
taba = taba.sort_values(by='January',ascending=False)
tabb = pd.crosstab(df.vehicle_type_code_1, df.month_, values=df.location, aggfunc='count',normalize='columns').round(4)*100
tab = (
pd.concat([taba, tabb],axis = 1, keys = ['count', '%'])
.swaplevel(axis = 1)
.sort_index(axis = 1, ascending=[True, False])
.rename_axis(['month_', 'vehicle_type_code_1'], axis = 1)
)[['January', 'February', 'March', 'April', 'May', 'June', 'July','August']]
tab.head(10)
| month_ | January | February | March | April | May | June | July | August | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vehicle_type_code_1 | count | % | count | % | count | % | count | % | count | % | count | % | count | % | count | % |
| vehicle_type_code_1 | ||||||||||||||||
| Sedan | 5678.0 | 43.55 | 5482.0 | 44.08 | 4537.0 | 44.83 | 1887.0 | 50.37 | 2725.0 | 48.50 | 3295.0 | 47.47 | 4028.0 | 47.97 | 3876.0 | 48.60 |
| Station Wagon/Sport Utility Vehicle | 5265.0 | 40.38 | 4794.0 | 38.55 | 3877.0 | 38.31 | 1324.0 | 35.34 | 2022.0 | 35.99 | 2459.0 | 35.43 | 2858.0 | 34.04 | 2817.0 | 35.32 |
| Taxi | 596.0 | 4.57 | 672.0 | 5.40 | 455.0 | 4.50 | 86.0 | 2.30 | 133.0 | 2.37 | 172.0 | 2.48 | 239.0 | 2.85 | 224.0 | 2.81 |
| Pick-up Truck | 375.0 | 2.88 | 348.0 | 2.80 | 299.0 | 2.95 | 94.0 | 2.51 | 102.0 | 1.82 | 137.0 | 1.97 | 184.0 | 2.19 | 181.0 | 2.27 |
| Box Truck | 249.0 | 1.91 | 250.0 | 2.01 | 199.0 | 1.97 | 63.0 | 1.68 | 118.0 | 2.10 | 132.0 | 1.90 | 153.0 | 1.82 | 135.0 | 1.69 |
| Bus | 249.0 | 1.91 | 225.0 | 1.81 | 152.0 | 1.50 | 25.0 | 0.67 | 48.0 | 0.85 | 49.0 | 0.71 | 74.0 | 0.88 | 73.0 | 0.92 |
| Tractor Truck Diesel | 87.0 | 0.67 | 104.0 | 0.84 | 71.0 | 0.70 | 27.0 | 0.72 | 46.0 | 0.82 | 56.0 | 0.81 | 59.0 | 0.70 | 61.0 | 0.76 |
| Van | 58.0 | 0.44 | 70.0 | 0.56 | 55.0 | 0.54 | 20.0 | 0.53 | 35.0 | 0.62 | 54.0 | 0.78 | 59.0 | 0.70 | 39.0 | 0.49 |
| Bike | 57.0 | 0.44 | 57.0 | 0.46 | 75.0 | 0.74 | 37.0 | 0.99 | 94.0 | 1.67 | 145.0 | 2.09 | 163.0 | 1.94 | 156.0 | 1.96 |
| Ambulance | 43.0 | 0.33 | 49.0 | 0.39 | 41.0 | 0.41 | 24.0 | 0.64 | 36.0 | 0.64 | 27.0 | 0.39 | 53.0 | 0.63 | 34.0 | 0.43 |
(1) In most accidents, Sedan,Station Wagon/Sport Utility, or Taxi vehicles were involved. Nearly half of the accident involve Sedan
(2) Most of the accident did not have victims injured in car accidents in 2020 (73%).
(3) Accidents tend to be more severe during night, late-evening, and weekends.
(4) Driver Inattention/Distraction is the main reason for the car accident (around 25%).
The best way to analyze spacial data is by using maps. Folium is a python library that helps you create several types of Leaflet maps. We can easily generate a map of New York City, creating a Folium Map object. The location argument allows to center the map in a specific location (in our case New York City). We can also provide an initial zoom level into that location to zoom the map into the center.
import folium
df['store_coords'] = None
df['lon'] = None
df['lan'] = None
# we will use regex to parse out liquor store coordinates
import re
for i, row in df.iterrows():
# get store location string
store_loc_str = row['location']
# extract the coordinates part of the string
store_coord_str = re.search(r'\((.*?)\)',store_loc_str)
if store_coord_str is None:
continue
store_coord_str = store_coord_str.group(1)
# splitting into [lat, lng]
store_coords = store_coord_str.split(' ')
# store_coords = [(float(numeric_string) for numeric_string in store_coords]
store_coords = [round(float(numeric_string),4) for numeric_string in store_coords]
store_coords = store_coords[::-1]
df['lon'] = store_coords[0]
# saving store_coords into data
if len(store_coords) == 2:
df.at[i, 'store_coords'] = store_coords
df['lan'] = store_coords[1]
# now that we have coordinates, we can drop the Store Location column
# df = df.drop(['store_location'], axis = 1)
store_coords_lst = df['store_coords'].tolist()
store_coords_lst = [i for i in store_coords_lst if i]
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import folium
from folium import plugins
liquor_map = folium.Map(
location=[df['latitude'].mean(), df['longitude'].mean()],
zoom_start = 15,
prefer_canvas = True,
disable_3d = True
)
liquor_map.add_child(
plugins.HeatMap(
data = store_coords_lst,
radius = 15
)
)
# liquor_map.save("map_output.html")
from folium.plugins import HeatMapWithTime
# Create a map object of the city of Barcelona
df_map = folium.Map(location=[df['latitude'].mean(), df['longitude'].mean()], zoom_start=12)
# Nested list that contains the latitud and the longitud of the different accidents.
hour_list = [[] for _ in range(9)]
for lat,log,month in zip(df.latitude,df.longitude,df.month):
hour_list[month].append([lat,log])
# Labels indicating the hours
# index = [str(i)+' month' for i in range(9)]
index = ['0','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug']
# hour_list
# Instantiate a heat map wiht time object for the car accidents
HeatMapWithTime(hour_list, index).add_to(df_map)
df_map
Looking at the above timeline, we can observe how the number of accidents per month
Thanks for reading!!! 😊 🍀